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Abstract 

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient 
descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initial- 
ization methods have been proposed to address this problem. In this paper, we first present an overview of these 
methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time 
complexity initialization methods on a large and diverse collection of data sets using various performance criteria. 
Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations 
for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in 
fact strong alternatives to these methods. 



1. Introduction 

Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in ex- 
ploratory data analysis Primary goals of clustering include gaining insight into data (detecting anomalies, 
identifying salient features, etc.), classifying data, and compressing data. Clustering has a long and rich history 
in a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, 
engineering, and computer science. As a result, numerous clustering algorithms have been proposed since the early 
1950s @]. 

Clustering algorithms can be broadly classified into two groups: hierarchical and partitional 0. Hierarchical 
algorithms recursively find nested clusters either in a top-down (divisive) or bottom-up (agglomcrative) fashion. 
In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not im- 
pose a hierarchical structure. Most hierarchical algorithms have quadratic or higher complexity in the number of 
data points [l[ and therefore are not suitable for large data sets, whereas partitional algorithms often have lower 
complexity. 

Given a data set X = {xi,x 2 , . . . , xjv} in R D , i.e., N points (vectors) each with D attributes (components), 
hard partitional algorithms divide X into K exhaustive and mutually exclusive clusters V = {Pi, P2, ■ ■ ■ , Pk }j 
(Ji=i Pi = X, Pi n P 3 ; = for 1 < i ^ j < K, These algorithms usually generate clusters by optimizing a criterion 
function. The most intuitive and frequently used criterion function is the Sum of Squared Error (SSE) given by: 

SSE = E E w^-^wl (!) 

where ||.||2 denotes the Euclidean (£2) norm and C; = l/|Pj| ^ x . e p. x j is the centroid of cluster Pi whose cardinality 
is |Pj|. The optimization of (TTJ) is often referred to as the minimum SSE clustering (MSSC) problem. 
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The number of ways in which a set of A objects can be partitioned into K non-empty groups is given by Stirling 
numbers of the second kind: 



w^)=iE(- i )^(. y 

which can be approximated by K N / K\ It can be seen that a complete enumeration of all possible clusterings to 
determine the global minimum of ([l} is clearly computationally prohibitive except for very small data sets [3j. In 
fact, this non-convex optimization problem is proven to be NP-hard even for A" = 2[ij]or_D = 2[5j]. Consequently, 
various heuristics have been developed to provide approximate solutions to this problem Among these heuristics, 
Lloyd's algorithm 0], often referred to as the (batch) k-means algorithm, is the simplest and most commonly used 
one. This algorithm starts with K arbitrary centers, typically chosen uniformly at random from the data points. 
Each point is assigned to the nearest center and then each center is recalculated as the mean of all points assigned 
to it. These two steps are repeated until a predefined termination criterion is met. 

The k-means algorithm is undoubtedly the most widely used partitional clustering algorithm Its popularity 

can be attributed to several reasons. First, it is conceptually simple and easy to implement. Virtually every data 
mining software includes an implementation of it. Second, it is versatile, i.e., almost every aspect of the algorithm 
(initialization, distance function, termination criterion, etc.) can be modified. This is evidenced by hundreds of 
publications over the last fifty years that extend k-means in various ways. Third, it has a time complexity that 
is linear in A, D, and K (in general, D <C A and K <C A). For this reason, it can be used to initialize more 
expensive clustering algorithms such as expectation maximization Q, DBSCAN and spectral clustering [loj . 



Furthermore, numerous sequential llj, [12| and parallel [13J acceleration techniques are available in the literature. 



Fourth, it has a storage complexity that is linear in A, D, and K . In addition, there exist disk-based variants that 



do not require all points to be stored in memory jl4|. Fifth, it is guaranteed to converge 15 1 at a quadratic rate 
[l6| . Finally, it is invariant to data ordering, i.e., random shufflings of the data points. 

On the other hand, k-mcans has several significant disadvantages. First, it requires the number of clusters, K, 
to be specified a priori. The value of this parameter can be determined automatically by means of various cluster 
validity measures [17| . Second, it can only detect compact, hyperspherical clusters that are well separated. This 
can be alleviated by using a more general distance function such as the Mahalanobis distance, which permits the 
detection of hyperellipsoidal clusters [3]. Third, due its utilization of the squared Euclidean distance, it is sensitive 
to noise and outlier points since even a few such points can significantly influence the means of their respective 
clusters. This can addressed by outlier pruning [T^] or using a more robust distance function such as City- block (Ci) 
distance. Fourth, due to its gradient descent nature, it often converges to a local minimum of the criterion function 



15| . For the same reason, it is highly sensitive to the selection of the initial centers. Adverse effects of improper 
initialization include empty clusters, slower convergence, and a higher chance of getting stuck in bad local minima 
[20I ] . Fortunately, all of these drawbacks except the first one can be remedied by using an adaptive initialization 
method (IM). 

In this study, we investigate some of the most popular IMs developed for the k-means algorithm. Our motivation 
is three-fold. First, a large number of IMs have been proposed in the literature and thus a systematic study that 
reviews and compares these methods is desirable. Second, these IMs can be used to initialize other partitional 
clustering algorithms such as fuzzy c-means and its variants and expectation maximization. Third, most of these 
IMs can be used independently of k-means as standalone clustering algorithms. 

This study differs from earlier studies of a similar nature [21, 22 1 in several respects: (i) a more comprehensive 



overview of the existing IMs is provided, (ii) the experiments involve a larger set of methods and a significantly 
more diverse collection of data sets, (iii) in addition to clustering effectiveness, computational efficiency is used 
as a performance criterion, and (iv) the experimental results are analyzed more thoroughly using non-parametric 
statistical tests. 

The rest of the paper is organized as follows. Section [2] presents a survey of k-means IMs. Section [3] describes 
the experimental setup. Section [4] presents the experimental results, while Section [5] gives the conclusions. 



2. Initialization Methods for K-Means 



In this section, we briefly review some of the commonly used IMs with an emphasis on their time complexity 
(with respect to A). In each complexity class, methods are presented in chronologically ascending order. 
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2.1. Linear Time- Complexity Initialization Methods 

Forgy's method [23} assigns each point to one of the K clusters uniformly at random. The centers are then 
given by the centroids of these initial clusters. This method has no theoretical basis, as such random clusters have 
no internal homogeneity [2~ij ]. 

Jancey's method [25| assigns to each center a synthetic point randomly generated within the data space. Unless 
the data set fills the space, some of these centers may be quite distant from any of the points which might lead 
to the formation of empty clusters. 

MacQueen [26| proposed two different methods. The first one, which is the default option in the Quick Cluster 
procedure of IBM SPSS Statistics [27|, takes the first K points in X as the centers. An obvious drawback of this 
method is its sensitivity to data ordering. The second method chooses the centers randomly from the data points. 
The rationale behind this method is that random selection is likely to pick points from dense regions, i.e., points 
that are good candidates to be centers. However, there is no mechanism to avoid choosing outliers or points that 
are too close to each other J24|. Multiple runs of this method is the standard way of initializing k- means 0. It 
should be noted that this second method is often mistakenly attributed to Forgy [23j. 

Ball and Hall's method [28| takes the centroid of X , i.e., X = 1/N J2jLi x ji as the first center. It then traverses 
the points in arbitrary order and takes a point as a center if it is at least T units apart from the previously selected 
centers until K centers are obtained. The purpose of the distance threshold T is to ensure that the seed points are 
well separated. However, it is difficult to decide on an appropriate value for T. In addition, the method is sensitive 
to data ordering. 

The Simple Cluster Seeking method (3] is identical to Ball and Hall's method with the exception that the first 
point in X is taken as the first center. This method is used in the FASTCLUS procedure of SAS [30| . 

Spath's method [3l[ is similar to Forgy's method with the exception that the points are assigned to the clusters 
in a cyclical fashion, i.e., the j-th (j G {1,2,..., N}) point is assigned to the (j — 1 (mod K) + l)-th cluster. In 
contrast to Forgy's method, this method is sensitive to data ordering. 

Maximin method 32j,[33| chooses the first center C! arbitrarily and the i-th (i G {2, 3, . . . , K}) center Cj is chosen 
to be the point that has the greatest minimum-distance to the previously selected centers, i.e., c l5 c 2 , . . . , Cj_x- This 
method was originally developed as a 2-approximation to the K-centev clustering problerrQ It should be noted 
that, motivated by a vector quantization application, Katsavounidis et aZ.'s variant [33| takes the point with the 
greatest Euclidean norm as the first center. 

Al-Daoud's density-based method [34| first uniformly partitions the data space into M disjoint hypercubes. It 
then randomly selects KN m /N points from hypercubc m (m G {1, 2, . . . , M}) to obtain a total of K centers (N m 
is the number of points in hypercube m). There are two main disadvantages associated with this method. First, 
it is difficult to decide on an appropriate value for M. Second, the method has a storage complexity of 0(2 BD ), 
where B is the number of bits allocated to each attribute. 

Bradley and Fayyad's method Q starts by randomly partitioning the data set into J subsets. These subsets are 
clustered using k-means initialized by MacQueen's second method producing J sets of intermediate centers each 
with K points. These center sets are combined into a superset, which is then clustered by k-means J times, each 
time initialized with a different center set. Members of the center set that give the least SSE arc then taken as the 
final centers. 

Pizzuti et al. [35| improved upon Al-Daoud's density-based method using a multiresolution grid approach. Their 
method starts with 2 D hypercubes and iteratively splits these as the number of points they receive increases. Once 
the splitting phase is completed, the centers are chosen from the densest hypercubes. 

The k-means++ method [36| interpolates between MacQueen's second method and the maximin method. It 
chooses the first center randomly and the i-th (i € {2,3, ... , K}) center is chosen to be x' G X with a probability of 

-^j, — 1 where meJ(x) denotes the minimum-distance from a point x to the previously selected centers. This 

method yields an 9 (log A") approximation to the MSSC problem. The greedy k-means++ method probabilistically 
selects log(A) centers in each round and then greedily selects the center that most reduces the SSE. This modification 
aims to avoid the unlikely event of choosing two centers that are close to each other. 

The PCA-Part method [3]} uses a divisive hierarchical approach based on PCA (Principal Component Analysis) 
[3^ . Starting from an initial cluster that contains the entire data set, the method iteratively selects the cluster with 
the greatest SSE and divides it into two subclusters using a hyperplane that passes through the cluster centroid 
and is orthogonal to the principal eigenvector of the cluster covariance matrix. This procedure is repeated until K 



1 Given a set of N points in a metric space, the goal of X-center clustering is to find K representative points (centers) such that the 
maximum distance of a point to a center is minimized. 
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clusters are obtained. The centers are then given by the centroids of these clusters. The Var-Part method |37| is 
an approximation to PCA-Part, where the covariance matrix of the cluster to be split is assumed to be diagonal. 
In this case, the splitting hyperplane is orthogonal to the coordinate axis with the greatest variance. 

Lu et a/.'s method [391 ] uses a two-phase pyramidal approach. The attributes of each point are first encoded 
as integers using 2^-level quantization, where Q is a resolution parameter. These integer points are considered to 
be at level of the pyramid. In the bottom-up phase, starting from level 0, neighboring data points at level k 
(k £ {0, 1, . . .}) are averaged to obtain weighted points at level k + 1 until at least 20K points are obtained. Data 
points at the highest level are refined using k-means initialized with the K points with the largest weights. In the 
top-down phase, starting from the highest level, centers at level k + 1 are projected onto level k and then used to 
initialize the fc-th level clustering. The top-down phase terminates when level is reached. The centers at this level 
are then inverse quantized to obtain the final centers. The performance of this method degrades with increasing 
dimensionality |39j. 

Onoda et a/.'s method [4(| first calculates K Independent Components (ICs) [4l[ of X and then chooses the i-th 
(i £ {1, 2, . . . , K}) center as the point that has the least cosine distance from the i-th IC. 

2.2. Loglinear Time- Complexity Initialization Methods 



Hartigan's method |42j first sorts the points according to their distances to X. The i-th (i £ {1, 2, . . . , if}) 
center is then chosen to be the (1 + (i — l)N/K)-th point. This method is an improvement over MacQueen's first 
method in that it is invariant to data ordering and is more likely to produce seeds that arc well separated. The 
computational cost of this method is dominated by the complexity of sorting, which is 0(N log N). 

Al-Daoud's variance-based method [HI first sorts the points on the attribute with the greatest variance and 
then partitions them into K groups along the same dimension. The centers are then chosen to be the points that 
correspond to the medians of these groups. Note that this method disregards all attributes but one and therefore 
is likely to be effective only for data sets in which the variability is mostly on one dimension. 



Redmond and Heneghan's method [44j first constructs a kd-tree of the data points to perform density esti- 
mation and then uses a modified maximin method to select K centers from densely populated leaf buckets. The 
computational cost of this method is dominated by the complexity of kd-tree construction, which is O(NlogN). 

The ROBIN (ROBust INitialization) method [HI uses a local outlier factor (LOF) J4(| to avoid selecting outlier 
points as centers. In iteration i (i £ {1, 2, . . . , K}), the method first sorts the data points in decreasing order of their 
minimum-distance to the previously selected centers. It then traverses the points in sorted order and selects the 
first point that has an LOF value close to 1 as the i-th center. The computational cost of this method is dominated 
by the complexity of sorting, which is 0(N log N). 

2.3. Quadratic- Complexity Initialization Methods 

Astrahan's method [I?} uses two distance thresholds d\ and c?2. It first calculates the density of each point as the 
number of points within a distance of d\ . The points are sorted in decreasing order by their densities and the highest 
density point is chosen as the first center. Subsequent centers are chosen in order of decreasing density subject 
to the condition that each new center be at least at a distance of d^ from the previously selected centers. This 
procedure is continued until no more centers can be chosen. Finally, if more than K centers are chosen, hierarchical 
clustering is used to group the centers until only K of them remain. The main problem with this method is that it 
is very sensitive to the values of d\ and d^. For example, if d\ is too small there may be many isolated points with 



zero density whereas if it is too large a few centers will cover the entire data set [24 1. 

Lance and Williams [48| suggested that the output of a hierarchical clustering algorithm can be used to initialize 
k-means. Despite the fact that such algorithms often have quadratic or higher complexity, this method is highly 
recommended in the statistics literature possibly due to the limited size of the data sets in this field. 

Kaufman and Rousseeuw's method Q takes X as the first center and the i-th (i £ {2,3,..., K}) center is chosen 
to be the point that most reduces the SSE. Since pairwise distances between the data points need to be calculated 
in each iteration, the time complexity of this method is 0{N 2 ). 

Cao et al. [501 ] formalized Astrahan's density-based method within the framework of a neighborhood-based 
rough set model. In this model, the e-ncighborhood of a point is defined as the set of points within e distance 
from it according to a particular distance measure. Based on this neighborhood model, the concepts of cohesion 
and coupling are defined. The former is a measure of the centrality of a point with respect to its neighborhood; 
whereas the latter is a measure of separation between two neighborhoods. The method first sorts the data points 
in decreasing order of their cohesion and takes the point with the greatest cohesion as the first center. It then 
traverses the points in sorted order and takes the first point that has a coupling of less than e with the previously 
selected centers as the i-th (i £ {2, 3, . . . , K}) center. The computational cost of this method is dominated by the 
complexity of the e-neighborhood calculations, which is 0(N 2 ). 
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2.4- Other Initialization Methods 



The binary-splitting method [5l| takes X as the first center. In iteration t (t G {1, 2, . . . , log 2 K}), each of the 
existing 2* _1 centers is split into two new centers by subtracting and adding a fixed perturbation vector e, i.e., 
Cj — e and C; + e (i 6 {1,2,..., 2* -1 }). These 2* new centers are then refined using k-means. There are two main 
disadvantages associated with this method. First, there is no guidance on the selection of a proper value for e, 
which determines the direction of the split (52^ . Second, the method is computationally demanding since after each 
iteration k-means has to be run for the entire data set. 

The directed-search binary-splitting method [HH is an improvement over the binary-splitting method in that it 
determines the value of e using PCA. However, it has even higher computational requirements due to the calculation 
of the principal eigenvector in each iteration. 

The global k-means method [HI takes X as the first center. In iteration i (i G {1, 2, . . . , K — 1}) it considers 
each of the N points in turn as a candidate for the (i + l)-st center and runs k-means with i + 1 centers on the entire 
data set. This method is computationally prohibitive for large data sets as it involves N(K — 1) runs of k-means 
on the entire data set. 

It should be noted that the two splitting methods and the global k-means method are not initialization methods 
per se. These methods can be considered as complete clustering methods that utilize k-means as a local search 
procedure. For this reason, to the best of our knowledge, none of the initialization studies to date included these 
methods in their comparisons. 



We should also mention IMs based on metahcuristics such as simulated annealing [54J and genetic algorithms 
[EBj]. These algorithms start from a random initial configuration (population) and use k-means to evaluate their 
solutions in each iteration (generation). There are two main disadvantages associated with these methods. First, 
they involve numerous parameters that are difficult to tune (initial temperature, cooling schedule, population size, 
crossover/mutation probability, etc.) pj. Second, due to the large search space, they often require a large number 
of iterations, which renders them computationally prohibitive for all but the smallest data sets. Interestingly, with 
the recent developments in combinatorial optimization algorithms, it is now feasible to obtain globally minimum 
SSE clusterings for small data sets without resorting to metaheuristics j56| . 

2. 5. Linear vs. Superlinear Initialization Methods 

Based on the descriptions given above, it can be seen that superlinear methods often have more elaborate designs 
when compared to linear ones. An interesting feature of the superlinear methods is that they are often deterministic, 
which can be considered as an advantage especially when dealing with large data sets. In contrast, linear methods 
are often non-deterministic and/or order-sensitive. As a result, it is common practice to perform multiple runs of 
such methods and take the output of the run that produces the least SSE 0] . 

A frequently cited advantage of the more elaborate methods is that they often lead to faster k-means convergence, 
i.e., require fewer iterations, and as a result the time gained during the clustering phase can offset the time lost 
during the initialization phase 37, 44, 45}. This may be true when a standard implementation of k-means is used. 



However, convergence speed may not be as important when a fast k-means variant is used as such methods often 
require significantly less time compared to a standard k-means implementation. In this study, we utilize a fast 



k-means variant based on triangle inequality [57J and partial distance elimination [581 ] techniques. As will be seen in 
S|4l this fast and exact k-means implementation will diminish the computational efficiency differences among various 
IMs. In other words, we will demonstrate that elaborate methods that lead to faster k-means convergence are not 
necessarily more efficient than simple methods with slower convergence. 



3. Experimental Setup 

3.1. Data Set Descriptions 

In order to obtain a comprehensive evaluation of various IMs, we conducted two sets of experiments. The first 
experiment involved 32 commonly used real data sets with sizes ranging from 214 to 1, 904, 711 points. Most of these 
data sets were obtained from the UCI Machine Learning Repository [59J (see Table [T]) The second experiment in- 
volved a large number of synthetic data sets with varying clustering complexity. We used a recent algorithm proposed 
by Maitra and Mclnykov [60j to generate these data sets. This algorithm involves the calculation of the exact overlap 
(w) between each cluster pair, measured in terms of their total probability of misclassification, and guided simulation 
of Gaussian mixture components satisfying prespecified overlap characteristics. The algorithm was used with the 
following parameters: mean overlap (ui G {0.025, 0.05, 0.1, 0.2}), number of points (N G {1024, 4096, 16384, 65536}), 
number of attributes (D G {2, 4, 8, 16, 32, 64}), and number of classes (K' G {2, 4, 6, 8, 10, 12}). 
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The parameter ui denotes the mean overlap between pairs of clusters. However, we observed that two synthetic 
data sets with the same Q can have considerably different clustering complexity. Therefore, we quantified clustering 
complexity using the following indirect approach. For each data set, we executed the k-means algorithm initialized 
with the "true" centers given by the cluster generation algorithm and calculated the RAND, VD, and VI measures 
(see §3.3p upon convergence. The average of these measures, CI, was taken as a quantitative indicator of clustering 
complexity. Note that each of these normalized measures takes values from the [0, 1] interval. For RAND larger values 
are better, whereas for VD and VI smaller values arc better. Therefore, we inverted the RAND values by subtracting 
them from 1 to make this measure compatible with the other two. Finally, using the aforementioned complexity 
quantification scheme, we generated 4, 096 synthetic data sets from each of the following complexity classes: easy 
(0 < SI < 0.25), moderate (0.25 < Q < 0.5), and difficult (0.5 < f2 < 1). The total number of synthetic data sets 
was thus 3 x 4, 096 = 12, 288. Figure Q] shows sample data sets with K = 6 clusters from each complexity class. 



•.•U«H5>.. 



(a) Easy (f2 = .103) (b) Moderate (n = .369) (c) Difficult (Q = .569) 

Figure 1: Synthetic data sets with K = 6 clusters 



3.2. Attribute Normalization 

In clustering tasks, normalization is a common preprocessing step that is necessary to prevent attributes with 
large ranges from dominating the distance calculations and also to avoid numerical instabilities in the computations. 
Two commonly used normalization schemes are linear scaling to unit range (min-max normalization) and linear 
scaling to unit variance (z-score normalization). Several studies revealed that the former scheme is preferable to 



the latter since the latter is likely to eliminate valuable betwcen-cluster variation [63J, |37|. As a result, we used 
min-max normalization to map the attributes of each real data set to the [0, 1] interval. Note that attributes of the 
synthetic data sets were already normalized by the cluster generation algorithm. 

3.3. Performance Criteria 

The performance of the IMs was measured using five effectiveness (quality) and two efficiency (speed) criteria: 

[> Initial SSE: This is the SSE value calculated after the initialization phase, before the clustering phase. It gives 
us a measure of the effectiveness of an IM by itself. 

> Final SSE: This is the SSE value calculated after the clustering phase. It gives us a measure of the effectiveness 
of an IM when its output is refined by k-means. Note that this is the objective function of the k-means 
algorithm, i.e., ([T]). 



> Normalized Rand (RAND) [64|, van Dongen (VD) [65J, and Variation of Information [66J criteria (VI): These are 
external validity measures that quantify the extent to which the clustering structure discovered by a clustering 
algorithm matches some external structure, e.g., one specified by the given class labels HH. In a recent 
comprehensive study, these three measures were found to be the best among 16 external validity measures 
[S2j . Note that each of these normalized measures takes values from the [0, 1] interval. 

> Number of Iterations: This is the number of iterations that k-means requires until reaching convergence when 
initialized by a particular IM. It is an efficiency measure independent of programming language, implementa- 
tion style, compiler, and CPU architecture. 

> CPU Time: This is the total CPU time taken by the initialization and clustering phases. This criterion is 
reported only for the real data sets. 
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Table 1: Descriptions of Real Data Sets 


ID 


Data Set 


# Points (AT) # Attributes (D) 


# Classes (K') 


1 


Breast Cancer Wisconsin (Original) 


683 


9 


2 


2 


Cloud Cover (DBl) 


1,024 


10 


8+ 


3 


Concrete Compressive Strength 


1,030 


g 


8+ 


4 


Corel Image Features 


68 040 


25 


16+ 


5 


Covertype 


581 012 


10 


7 


6 


Ecoli 


336 


7 


8 


7 


Steel Plates Faults 


1,941 


27 


7 


8 


Glass Identification 


214 


g 


6 


9 


Heart Disease 


297 


13 


5 


10 


Ionosphere 


351 


34 


2 


11 


ISOLET 


7,797 


617 


26 


12 


Landsat Satellite (Statlog) 


6,435 


36 


6 


13 


Letter Recognition 


20 000 


16 


26 


14 


MAGIC Gamma Telescope 


19,020 


10 


2 


15 


Multiple Features (Fourier) 


2,000 


7G 


10 


1G 


MiniBooNE Particle Identification 


1 30 064 


50 


2 


17 


Musk (Clean2) 


6598 


166 


2 


18 


Optical Digits 


5,620 


64 


10 


19 


Page Blocks Classification 


5,473 


10 


5 


20 


Parkinsons 


5,875 


18 


42+ 


21 


Pen Digits 


10,992 


16 


10 


22 


Person Activity 


1 64 860 


3 


11 


23 


Pima Indians Diabetes 


768 


8 


2 


24 


Image Segmentation 


2,310 


19 


7 


25 


Shuttle (Statlog) 


58,000 


9 


7 


20 


SPECTF Heart 


267 


44 


2 


27 


Telugu Vowels [61] 


871 


3 


6 


28 


Vehicle Silhouettes (Statlog) 


846 


18 


4 


29 


Wall-Following Robot Navigation 


5,456 


24 


4 


30 


Wine Quality 


6,497 


11 


7 


31 


World TSP [62] 


1,904,711 


2 


7+ 


32 


Yeast 


1,484 


8 


10 



+ Due to the unavailability of class labels, for data sets #2, #3, and #4, K' was chosen arbitrarily, 
whereas for #20 and #31, it was determined based on domain knowledge. 



All of the methods were implemented in the C language, compiled with the gec v4.4.3 compiler, and executed on 
an Intel Xeon E5520 2.26GHz machine. Time measurements were performed using the getrusage function, which 
is capable of measuring CPU time to an accuracy of a microsecond. The MT19937 variant of the Mersenne Twister 
algorithm was used to generate high-quality pseudorandom numbers [69|. 

The convergence of k-means was controlled by the disjunction of two criteria: the number of iterations reaches 
a maximum of 100 or the relative improvement in SSE between two consecutive iterations drops below a threshold, 
i.e., (SSEi_! - SSEj)/SSEj < e, where SSEi denotes the SSE value at the end of the i-th (i € {1, 2, ... , 100}) iteration. 
The convergence threshold was set to e = 10~ 6 . 

4. Experimental Results and Discussion 

In this study, we focus on IMs that have time complexity linear in N. This is because k-means itself has linear 
complexity, which is perhaps the most important reason for its popularity. Therefore, an IM for k-means should 
not diminish this advantage of the algorithm. Eight commonly used, order-invariant IMs were included in the 
experiments: Forgy's method (F), MacQueen's second method (M), maximin (X), Bradley and Fayyad's method 
(B) with J = 10, k-means++ (K), greedy k-means++ (G), Var-Part (V), and PCA-Part (P). It should be noted 
that among these methods only V and P are deterministic. 
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In the experiments, each non-deterministic method was executed a 100 times and statistics such as minimum, 
mean, and standard deviation were collected for the effectiveness criteria. In each run, the number of clusters (K) 
was set equal to the number of classes (K 1 ), as commonly seen in the related literature 0, [35, 22, 3(1 37, 45l [50| . 



Tables [2] and [3] give the Final SSE and CPU time (in milliseconds) results for the real data sets, respectively. 
Note that, due to space limitations, only mean values are reported for the CPU time criterion. In order to determine 
if there are any statistically significant differences among the methods, we employed two non-parametric statistical 



tests [70J: the Friedman test [71| and Iman & Davenport test [72j. These tests are alternatives to the parametric 
two-way analysis of variance (ANOVA) test. Their advantage over ANOVA is that they do not require normality 
or homoscedasticity, assumptions that are often violated in machine learning studies (73l . [74j ■ 

Given B blocks (subjects) and T treatments (measurements), the null hypothesis (H ) of the Friedman test is 
that populations within a block are identical. The alternative hypothesis (H±) is that at least one treatment tends 
to yield larger (or smaller) values than at least one other treatment. The test statistic is calculated as follows [75j |. 
In the first step, the observations within each block are ranked separately, so each block contains a separate set of 
T ranks. If ties occur, the tied observations are given the mean of the rank positions for which they are tied. If 
Hq is true, the ranks in each block should be randomly distributed over the columns (treatments). Otherwise, we 
expect a lack of randomness in this distribution. For example, if a particular treatment is better than the others, we 
expect large (or small) ranks to 'favor' that column. In the second step, the ranks in each column are summed. If 
Hq is true, we expect the sums to be fairly close — so close that we can attribute differences to chance. Otherwise, 
we expect to see at least one difference between pairs of rank sums so large that we cannot reasonably attribute it 
to sampling variability. The test statistic is given as: 
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xl= pfp/rp i 1N ;>j-3B(r + i) (3) 

3=1 



BT(T + 1) 



where Rj (j G {1, 2, . . . , T}) is the rank sum of the j-th column. Xr is approximately chi-square with T — 1 degrees 
of freedom. Hq is rejected at the a level of significance if the value of ((3]) is greater than or equal to the critical 
chi-square value for T — 1 degrees of freedom. Iman and Davenport (72l | proposed the following statistic: 

( fi - 1 )Xr (4) 



r B(T -l)-xl 

which is distributed according to the F-distribution with T — 1 and (T — l)(B — 1) degrees of freedom. When 



compared to xj?, this statistic is not only less conservative, but also more accurate for small sample sizes [72 1. 

In this study, blocks and treatments correspond to data sets and initialization methods, respectively. Our goal is 
to determine whether or not there is at least one method that is significantly better than at least one other method 
at the a = 0.05 level. If this is the case, we will conduct a post-hoc (multiple comparison) test to determine which 
pairs of methods differ significantly. For this purpose, we will use the Bergmann-Hommel test [76j], a powerful 
post-hoc procedure that has been used successfully in various machine learning studies 



4-.1. Real Data Sets 

Table 0] gives the Final SSE rankings of the IMs for the real data sets as determined by the Bergmann-Hommel 
procedure using data given in Table [5J Here, a notation such as C < {D, E} indicates that there is no statistically 
significant difference between methods D and E and these two methods are significantly better than method C. 
From Table |4] it can be seen that the methods cannot be distinguished from each other reliably. This was expected 
since even nonparametric post-hoc tests lack discrimination power in small sample cases (recall that only 32 data 
sets were used) with a large number of ties (see Table [2]). For example, with respect to the minimum statistic, the 
performances of F, M, B, K, and G are statistically indistinguishable. In other words, if we initialize k-means with 
each of these non-deterministic methods and execute it until convergence, the resulting clusterings over R = 100 
runs will have very similar minimum Final SSE values. Similar trends were observed for the RAND, VD, and VI criteria 
(data not shown). Given the abundance of local minima even in data sets of moderate size and/or dimensionality 
and the gradient descent nature of k-means, it is not surprising that the deterministic methods V and P are 
outperformed by most of the non-deterministic methods as the former methods were executed only once, whereas 
the latter ones were executed R — 100 times. 

As mentioned earlier, the minimum statistic is meaningful only when it is practical to execute k-means multiple 
times. Otherwise, the mean statistic is more meaningful. The analysis of mean Final SSE results using the 
Bergmann-Hommel procedure reveals that deterministic methods V and P are the preferred choices in this case. 
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This is not surprising since non-deterministic methods, in particular those that are ad hoc in nature, often produce 
highly variable results across multiple runs. 

The standard deviation statistic characterizes the reliability of a non-deterministic IM with respect to a particular 
performance criterion. In other words, if a non-deterministic IM obtains low mean and standard deviation with 
respect to an effectiveness criterion, we do not have to execute this method a large number of times to obtain good 
results. The analysis of Final SSE standard deviations reveals two overlapping groups of methods. Once again this 
is not necessarily because the members of each group are in fact indistinguishable with respect to their reliability, 
but due to the relatively small sample size used. In summary, due to the necessarily small number of real-world 
data sets available for clustering studies, it may not be possible to distinguish among various IMs. Therefore, it is 
crucial that these IMs be tested on a large number of synthetic data sets (see N4.2[) . 

As for computational efficiency, it can be seen from Table [3] that, in general, the IMs have similar computational 
requirements per run. However, in practice, a non-deterministic method is typically executed R times and the 
output of the run that gives the least SSE is taken as the result. Therefore, the total computational cost of a 
non-deterministic method is often significantly higher than that of a deterministic method. As predicted in §2.51 
simple methods such as M require about the same CPU time as elaborate methods such as G. This is because 
simple methods often lead to more k-means iterations, whereas elaborate ones compensate for their computational 
overhead by requiring fewer k-means iterations. It should be noted that efficiency differences among the methods 
can be further reduced by using faster k-mcans variants such as those described in 
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4-2. Synthetic Data Sets 

Tablc[5]gives the ranking of the IMs with respect to the minimum statistic. It can be seen that despite variations 
in rankings across the performance criteria, some general trends emerge: 

> Non-deterministic methods outperform the deterministic ones, i.e., V and P, except in the case of Initial SSE. 
As explained in H4.ll this is due to the fact that the non-deterministic methods were executed R = 100 times, 
whereas the deterministic ones were executed only once. The reason why deterministic methods have good 
Initial SSE performance is because these methods are approximate (divisive hierarchical) clustering methods 
by themselves and thus they give reasonable results even without k-means refinement. 

> Method B consistently appears in the best performing group, whereas methods F and X are often among the 
worst non-deterministic methods. 

> Method M exhibits moderate-to-good performance except in the case of Initial SSE. Recall that this method 
randomly selects the K initial centers from among the data points and therefore it cannot be expected to 
perform well without k-means refinement. 

> Methods K and G generally perform well. In some cases the latter outperforms the former, whereas in others 
they have comparable performance. 

Table [H] gives the ranking of the IMs with respect to the mean statistic. It can be seen that despite variations 
in rankings across the performance criteria, some general trends emerge: 



> Deterministic methods, i.e., V and P, generally outperform the non-deterministic ones. As explained in §4.1( 
this is due to the fact that the non-deterministic methods can produce highly variable results across multiple 
runs. Method B is highly competitive with the deterministic methods. 

> Methods M and X are often among the worst performers, whereas methods F and K exhibit moderate-to-bad 
performance. 

> Method G is often significantly better than all non-deterministic methods but B. 

Table [7] gives the ranking of the non-deterministic IMs with respect to the standard deviation statistic. It can 
be seen that despite variations in rankings across the performance criteria, some general trends emerge: 

> Method B consistently appears in the best performing group, whereas method M is often among the worst 
performers. 

> Methods X and K exhibit moderate-to-bad performance. 



> Method F and G are significantly better than all methods but B. 
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4-3. Recommendations for Practitioners 

Based on the statistical analyses presented in the previous section, the following recommendations can be made: 

> In general, methods F, M, and X should not be used. These methods are easy to understand and implement, 
but they are often ineffective and unreliable. Furthermore, despite their low overhead, these methods do not 
offer significant time savings since they often result in slower k-means convergence. 

t> In time-critical applications that involve large data sets or applications that demand determinism, methods 
V or P should be used. These methods need to be executed only once and they lead to very fast k-means 
converge. The efficiency difference between the two is noticeable only on high dimensional data sets. This 
is because method V calculates the direction of split by determining the coordinate axis with the greatest 
variance (in 0(D) time), whereas method P achieves this by calculating the principal eigenvector of the 
covariance matrix (in 0{D 2 ) time using the power method [38|). Note that despite its higher computational 
complexity, method P can, in some cases, be more efficient than method V (sec Table [3]). This is because the 
former converges significantly faster than the latter (see Table H]) . The main disadvantage of these methods 
is that they are more complicated to implement due to their hierarchical formulation. 

> In applications that involve small data sets, e.g., N < 10,000, methods B or G should be used. It is computa- 
tionally feasible to run these methods hundreds of times on such data sets given that one such run takes only 
a few milliseconds. 

> In applications where an approximate clustering of the data set is desired, methods B, G, V, or P should be 
used. These methods produce very good initial clusterings (see Tables [5] and [6]) , which makes it possible to 
use them as standalone clustering algorithms. 

5. Conclusions 

In this paper we presented an overview of k-means initialization methods with an emphasis on their computa- 
tional efficiency. We then compared eight commonly used linear time initialization methods on a large and diverse 
collection of real and synthetic data sets using various performance criteria. Finally, we analyzed the experimental 
results using non-parametric statistical tests and provided recommendations for practitioners. Our statistical anal- 
yses revealed that popular initialization methods such as forgy, macqueen, and maximin often perform poorly and 
that there are significantly better alternatives to these methods that have comparable computational requirements. 
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Table 2: Final SSE (Real Data Sets) 







F 


M 


X 


B 


K 


G 


V 


p 


1 


mm 
mean 


239 
239±0 


239 
239±0 


239 
239±0 


239 
239±0 


239 
239±0 


239 
239±0 


239 
239±0 


239 
239±0 


2 


mm 
mean 


38 
44±4 


38 
40±1 


41 
45±3 


38 
39±1 


38 
39±1 


38 
39±1 


39 
39±0 


38 
38±0 


3 


min 
mean 


167 
176±8 


167 
176±7 


167 
172±6 


167 
171±4 


167 
174±5 


167 
173±4 


173 
173±0 


167 
167±0 


4 


min 
mean 


10057 
10115±79 


10057 
10080±20 


10057 
10077±15 


10057 
10076±18 


10058 
10083±23 


10058 
10080±18 


10101 
10101±0 


10100 
10100±0 


5 


mm 
mean 


66224 
66990±890 


66224 
67196±1048 


66224 
67350±834 


66224 
66431±360 


66224 
66948±876 


66224 
66930±773 


66238 
66238±0 


66238 
66238±0 


6 


mm 
mean 


17 
19±2 


17 

19±3 


19 
20±1 


17 
18±1 


17 
18±1 


17 
18±1 


17 
17±0 


18 
18±0 


7 


mm 
mean 


1167 
1231±83 


1167 
1250±103 


1267 
1303±53 


1167 
1184±25 


1167 
1230±61 


1167 
1198±35 


1167 
1167±0 


1168 
1168±0 
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mm 
mean 


18 
20±1 


18 
20±2 


19 
22±2 


18 
20±1 


18 
20±2 


18 
20±1 


19 
19±0 


19 
19±0 
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mm 
mean 


243 
251±8 


243 
252±8 


243 
253±8 


243 
251±7 


243 
252±8 


243 
249±7 


248 
248±0 


243 
243±0 


10 


mm 
mean 


629 
629±0 


629 
633±28 


629 
671±81 


629 
637±39 


629 
635±34 


629 
635±35 


629 
629±0 


629 
629±0 


11 


mm 
mean 


117891 
119931±1060 


117924 
119655±1061 


120898 
123388±1264 


117863 
119050±699 


117719 
119538±894 


117995 
119176±710 


118495 
118495±0 


118386 
118386±0 


12 


mm 
mean 


1742 
1742±0 


1742 
1742±0 


1742 
1742±0 


1742 
1742±0 


1742 
1744±28 


1742 
1747±41 


1742 
1742±0 


1742 
1742±0 


13 


mm 
mean 


2723 
2775±28 


2718 
2756±22 


2721 
2765±17 


2719 
2742±15 


2718 
2754±18 


2715 
2752±17 


2735 
2735±0 


2745 
2745±0 


14 


mm 
mean 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


2923 
2923±0 


15 


mm 
mean 


3127 
3164±30 


3128 
3168±28 


3180 
3247±22 


3127 
3157±29 


3128 
3173±33 


3127 
3149±20 


3137 
3137±0 


3214 
3214±0 


16 


mm 
mean 


2802 
8229±8685 


2802 
12518±9667 


2802 
2802±0 


2802 
11935±9656 


2802 
5722±6944 


2802 
3774±4236 


21983 
21983±0 


2802 
2802±0 


17 


min 
mean 


36373 
37755±2829 


36373 
37046±916 


36373 
36738±754 


36373 
37152±1340 


36373 
37440±1906 


36373 
37103±1639 


36373 
36373±0 


36373 
36373±0 


18 


mm 
mean 


14559 
14653±140 


14559 
14763±273 


14559 
14774±293 


14559 
14627±66 


14559 
14735±234 


14559 
14719±214 


14581 
14581±0 


14807 
14807±0 


19 


mm 
mean 


215 
217±4 


215 
217±4 


230 
254±32 


215 
219±6 


215 
219±10 


215 
217±4 


227 
227±0 


215 
215±0 


20 


mm 
mean 


235 
251±7 


219 
224±2 


233 
241±3 


218 
222±2 


217 
220±2 


217 
219±1 


220 
220±0 


219 
219±0 


21 


min 
mean 


4930 
5130±131 


4930 
5091±110 


4930 
5036±106 


4930 
5012±70 


4930 
5111±116 


4930 
5046±75 


4930 
4930±0 


5004 
5004±0 


22 


min 
mean 


1177 
1179±10 


1177 
1187±18 


1195 
1204±25 


1177 
1182±12 


1177 
1193±27 


1177 
1183±14 


1182 
1182±0 


1177 
1177±0 


23 


min 
mean 


121 
121±2 


121 
122±5 


121 
122±3 


121 
122±3 


121 
122±5 


121 

122±5 


121 
121±0 


121 
121±0 


24 


min 
mean 


387 
407±23 


387 
414±20 


411 
430±21 


387 
402±16 


387 
410±19 


387 
402±13 


410 
410±0 


405 
405±0 


25 


min 
mean 


235 
307±39 


235 
275±23 


411 
930±105 


235 
244±18 


235 
271±39 


235 
246±21 


235 
235±0 


274 
274±0 


20 


min 
mean 


214 
214±0 


214 
214±0 


214 
214±0 


214 
214±0 


214 
214±0 


214 
214±0 


214 
214±0 


214 
214±0 


27 


min 
mean 


22 
23±2 


22 
23±1 


22 
23±0 


22 
23±1 


22 
23±1 


22 
23±0 


23 
23±0 


23 
23±0 


28 


min 
mean 


223 
224±2 


223 
226±4 


224 
237±1 


223 
228±6 


223 
226±5 


223 
225±3 


224 
224±0 


224 
224±0 


20 


min 
mean 


7772 
7798±91 


7772 
7808±102 


7772 
7854±160 


7772 
7773±1 


7772 
7831±140 


7772 
7811±106 


7774 
7774±0 


7774 
7774±0 


30 


mm 
mean 


334 
335±2 


334 
336±2 


348 
374±17 


334 
337±5 


334 
336±3 


334 
336±3 


335 
335±0 


334 
334±0 


31 


mm 
mean 


11039 
i A n/l i -t- 1 fisfi 


11039 

1 IC1U0 i 


11039 
11714±627 


11039 
11128±231 


11039 
11773±872 


11039 


11483 


12422 

1 OA 99-t-fl 


32 


mm 

mc an 


58 
64i5 


58 
70i6 


61 
61±1 


58 
66±6 


58 
63±5 


58 
59il 


69 
69i0 


59 
59i0 










Table 3: CPU Time (Real Data Sets) 












F 


M 


X 


B 


K 


G 


V 


P 




i 




























2 


3 


3 


2 


4 


3 


2 


10 







3 


2 


2 


2 


4 


2 


2 





10 




4 


2295 


2248 


2173 


3624 


2332 


2459 


1900 


2540 




5 


2183 


2229 


2714 


3604 


2273 


2274 


1730 


2120 




6 











1 
















7 


8 


8 


7 


12 


9 


9 





20 




8 











1 
















9 





1 





1 





1 
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10 











1 














11 


3730 


3469 


2469 


4063 


3537 


3915 


6940 


12200 


12 


28 


32 


40 


40 


34 


32 


40 


50 


13 


700 


693 


729 


852 


698 


693 


950 


800 


14 


20 


19 


28 


30 


21 


19 


30 


20 


15 


54 


56 


62 


70 


57 


58 


30 


70 


16 


252 


283 


58 


417 


112 


96 


230 


570 


17 


22 


20 


24 


34 


25 


26 


20 


220 


18 


112 


116 


131 


137 


121 


125 


60 


140 


19 


9 


10 


7 


12 


8 


10 


10 


10 


20 


109 


100 


110 


150 


97 


118 


60 


70 


21 


59 


53 


52 


67 


56 


59 


30 


50 


22 


430 


524 


314 


718 


469 


513 


730 


480 


23 


1 


1 


1 


1 


1 











24 


6 


5 


7 


8 


6 


7 


10 





25 


84 


82 


19 


122 


79 


81 


100 


80 


26 











1 








10 





27 


1 


1 





1 


1 


1 








28 


3 


2 


1 


3 


2 


3 


10 


10 


29 


20 


20 


20 


28 


19 


21 


10 


20 


30 


38 


38 


24 


51 


36 


38 


60 


40 


31 


748 


949 


1377 


2439 


918 


1044 


580 


840 


32 


5 


6 


5 


9 


6 


6 









Table 4: Final SSE Rankings (Real Data Sets) 



Statistic 


IM Ranking 


Minimum 


{X,V,P} < {F, M, B, K, G} 


Mean 


{F, M,X, K} < {F, B,G} < {G, V} < {V, P} 


Standard Deviation 


{F,M,X,B,K}<{X,B,G} 


Table 5: 


Minimum (Synthetic Data Sets) 



Data Set Complexity 


IM Ranking 


Initial SSE 


Easy 


X<{F,M}<K<G<V<P<B 


Moderate 


X < {F, M,K} < G < V < P < B 


Difficult 


X<{M,K}<F<G<V<P<B 


Final SSE 


Easy 


{V,P} < {F,X} < {M,B,K,G} 


Moderate 


{V, P} < {F,X} < {F, M, B, K} < {M, B, K, G} 


Difficult 


V < P < {F,X} < {M,X,B,K,G} 


Final RAND 


Easy 


{V, P} < X < F < {M, K} < G < B 


Moderate 


{V, P} < X < {F, M, K} < {M, K, G} < {B, G} 


Difficult 


V < P < {F,X} < {F,K,G} < {M,K,G} < {M,B,K} 


Final VD 


Easy 


{V,P} < X < {F,M} < {K,G} < B 


Moderate 


{V, P} < X < {F, M} < {M, K} < {B, K, G} 


Difficult 


V < P < {F,X,G} < {M,K,G} < {M,B,K} 


Final VI 


Easy 


{V,P} < X < {F,M,K,G} < B 


Moderate 


{V, P} < X < F < {M,B,K,G} 


Difficult 


V < P < {F,X} < {X,G} < {M, B, K, G} 


Number of Iterations 


Easy 


V<F<M<{X,K}<P<G<B 


Moderate 


V<P<F<M<K<{X,G}<B 


Difficult 


V<P<F<M<K<G<X<B 
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Table 6: Mean (Synthetic Data Sets) 



Data Set Complexity IM Ranking 


Initial SSE 


Easy 


X<M<K<F<G<V<{B,P} 


Moderate 


X < {M,K} < F < G < V < {B,P} 


Difficult 


X<K<M<F<G<V<B<P 


Final SSE 


Easy 


{M,X} < K < F < G < B < {V,P} 


Moderate 


X<{M,K}<F<G<B<V<P 


Difficult 


F<{M,K}<X<G<B<V<P 


Final RAND 


Easy 


{M,X} < K < F < G < B < {V, P} 


Moderate 


X < {M,K} < {F,G} < B < {V,P} 


Difficult 


{F, K} < {X, K, G} < M < V < B < P 


Final VD 


Easy 


{M,X} < K < F < G < B < {V,P} 


Moderate 


X<{M,K}<F<G<B<V<P 


Difficult 


F < {M,X,K,G} < V < B < P 


Final VI 


Easy 


{M,X} < K < F < G < B < {V,P} 


Moderate 


{M,X,K} <F<G<B<V<P 


Difficult 


F < {M, K, G} < {X, G} < V < B < P 


Number of Iterations 


Easy 


M<X<K<F<G<V<P<B 


Moderate 


{F,M} < {X,K} < G < V < P < B 


Difficult 


F<M<K<G<X<V<P<B 


Table 7: 


Standard Deviation (Synthetic Data Sets) 


Data Set Complexity IM Ranking 


Initial SSE 


Easy 


X<M<K<G<F<B 


Moderate 


Same as Easy 


Difficult 


X < M < K < G < {F,B} 


Final SSE 


Easy 


M < {X, K} < G < F < B 


Moderate 


{M, K} < X < {F, G} < B 


Difficult 


{M,K} < {F,X,G} < B 


Final RAND 


Easy 


M < {X, K} < G < F < B 


Moderate 


{M,X,K} < {F,G} < B 


Difficult 


{M,K} < {F,X,G} < B 


Final VD 


Easy 


M < {X, K} < G < F < B 


Moderate 


{M,K} < X < {F,G} < B 


Difficult 


{M,K} < {K,G} < {F, G} < X < B 


Final VI 


Easy 


M < {X,K} < {F, G} < B 


Moderate 


{M,K} < X < {F, G} < B 


Difficult 


{F, M, K, G} < X < B 


Number of Iterations 


Easy 


M< K < {X, G} < F < B 


Moderate 


{M,K} < X < {F, G} < B 


Difficult 


{F,M,X,K,G} < B 



